smaller dataset
ray tuning models
The class distribution of smaller datasets match the class distribution of the complete dataset. Weperformed apreliminary ablation analysis with oneofthedataset, NIH-Chest Xray dataset, to understand towhich blocks ofResNet-50 should we apply the intermediate loss. Theclassdistribution of smaller datasets match the class distribution of the complete dataset. Theclassdistribution of smaller datasets match the class distribution of the complete dataset. The preliminary ablation study gave the evidence that applying intermediate loss to all blocks yielded superior results.
MultiClaimNet: A Massively Multilingual Dataset of Fact-Checked Claim Clusters
Panchendrarajan, Rrubaa, Míguez, Rubén, Zubiaga, Arkaitz
In the context of fact-checking, claims are often repeated across various platforms and in different languages, which can benefit from a process that reduces this redundancy. While retrieving previously fact-checked claims has been investigated as a solution, the growing number of unverified claims and expanding size of fact-checked databases calls for alternative, more efficient solutions. A promising solution is to group claims that discuss the same underlying facts into clusters to improve claim retrieval and validation. However, research on claim clustering is hindered by the lack of suitable datasets. To bridge this gap, we introduce \textit{MultiClaimNet}, a collection of three multilingual claim cluster datasets containing claims in 86 languages across diverse topics. Claim clusters are formed automatically from claim-matching pairs with limited manual intervention. We leverage two existing claim-matching datasets to form the smaller datasets within \textit{MultiClaimNet}. To build the larger dataset, we propose and validate an approach involving retrieval of approximate nearest neighbors to form candidate claim pairs and an automated annotation of claim similarity using large language models. This larger dataset contains 85.3K fact-checked claims written in 78 languages. We further conduct extensive experiments using various clustering techniques and sentence embedding models to establish baseline performance. Our datasets and findings provide a strong foundation for scalable claim clustering, contributing to efficient fact-checking pipelines.
- Health & Medicine > Therapeutic Area > Immunology (0.68)
- Government > Regional Government > Europe Government (0.46)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Communications > Social Media (0.93)
Tab2Visual: Overcoming Limited Data in Tabular Data Classification Using Deep Learning with Visual Representations
Mamdouh, Ahmed, El-Melegy, Moumen, Ali, Samia, Kikinis, Ron
This research addresses the challenge of limited data in tabular data classification, particularly prevalent in domains with constraints like healthcare. We propose Tab2Visual, a novel approach that transforms heterogeneous tabular data into visual representations, enabling the application of powerful deep learning models. Tab2Visual effectively addresses data scarcity by incorporating novel image augmentation techniques and facilitating transfer learning. We extensively evaluate the proposed approach on diverse tabular datasets, comparing its performance against a wide range of machine learning algorithms, including classical methods, tree-based ensembles, and state-of-the-art deep learning models specifically designed for tabular data. We also perform an in-depth analysis of factors influencing Tab2Visual's performance. Our experimental results demonstrate that Tab2Visual outperforms other methods in classification problems with limited tabular data.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > Middle East > Kuwait (0.04)
- Africa > Middle East > Egypt (0.04)
- Research Report > Promising Solution (1.00)
- Research Report > New Finding (1.00)
- Overview (1.00)
Exploring Transfer Learning for Deep Learning Polyp Detection in Colonoscopy Images Using YOLOv8
Vazquez, Fabian, Nuñez, Jose Angel, Fu, Xiaoyan, Gu, Pengfei, Fu, Bin
Deep learning methods have demonstrated strong performance in objection tasks; however, their ability to learn domain-specific applications with limited training data remains a significant challenge. Transfer learning techniques address this issue by leveraging knowledge from pre-training on related datasets, enabling faster and more efficient learning for new tasks. Finding the right dataset for pre-training can play a critical role in determining the success of transfer learning and overall model performance. In this paper, we investigate the impact of pre-training a YOLOv8n model on seven distinct datasets, evaluating their effectiveness when transferred to the task of polyp detection. We compare whether large, general-purpose datasets with diverse objects outperform niche datasets with characteristics similar to polyps. In addition, we assess the influence of the size of the dataset on the efficacy of transfer learning. Experiments on the polyp datasets show that models pre-trained on relevant datasets consistently outperform those trained from scratch, highlighting the benefit of pre-training on datasets with shared domain-specific features.
- North America > United States > Texas > Hidalgo County > Edinburg (0.04)
- Asia > China > Fujian Province > Fuzhou (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (0.66)
Reviews: High-Quality Self-Supervised Deep Image Denoising
Pros: -The Bayesian analysis with different noise models is interesting. The ablation study is carefully done and confirms the importance of the central pixel integration at test time. This is an important result and may be used in future works. I also find interesting that performance is not too degraded when noise level is unknown. It suggests the potential for image denoising using only single instances of corrupted images as training data.
Alleviating Overfitting in Transformation-Interaction-Rational Symbolic Regression with Multi-Objective Optimization
The Transformation-Interaction-Rational is a representation for symbolic regression that limits the search space of functions to the ratio of two nonlinear functions each one defined as the linear regression of transformed variables. This representation has the main objective to bias the search towards simpler expressions while keeping the approximation power of standard approaches. The performance of using Genetic Programming with this representation was substantially better than with its predecessor (Interaction-Transformation) and ranked close to the state-of-the-art on a contemporary Symbolic Regression benchmark. On a closer look at these results, we observed that the performance could be further improved with an additional selective pressure for smaller expressions when the dataset contains just a few data points. The introduction of a penalization term applied to the fitness measure improved the results on these smaller datasets. One problem with this approach is that it introduces two additional hyperparameters: i) a criteria to when the penalization should be activated and, ii) the amount of penalization to the fitness function. In this paper, we extend Transformation-Interaction-Rational to support multi-objective optimization, specifically the NSGA-II algorithm, and apply that to the same benchmark. A detailed analysis of the results show that the use of multi-objective optimization benefits the overall performance on a subset of the benchmarks while keeping the results similar to the single-objective approach on the remainder of the datasets. Specifically to the small datasets, we observe a small (and statistically insignificant) improvement of the results suggesting that further strategies must be explored.
- North America > United States > New York > New York County > New York City (0.04)
- South America > Brazil > São Paulo (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Data Augmentation Techniques for Chinese Disease Name Normalization
Cui, Wenqian, Fu, Xiangling, Liu, Shaohui, Gu, Mingjun, Liu, Xien, Wu, Ji, King, Irwin
Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data
Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression
Valentin, Six, Alexandre, Chidiac, Arkin, Worlikar
This document is a replication of the original "Rank-N-Contrast" (arXiv:2210.01189v2) paper published in 2023. This evaluation is done for academic purposes. Deep regression models often fail to capture the continuous nature of sample orders, creating fragmented representations and suboptimal performance. To address this, we reproduced the Rank-N-Contrast (RNC) framework, which learns continuous representations by contrasting samples by their rankings in the target space. Our study validates RNC's theoretical and empirical benefits, including improved performance and robustness. We extended the evaluation to an additional regression dataset and conducted robustness tests using a holdout method, where a specific range of continuous data was excluded from the training set. This approach assessed the model's ability to generalise to unseen data and achieve state-of-the-art performance. This replication study validates the original findings and broadens the understanding of RNC's applicability and robustness.
Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models
Kislay, Kaustubh, Singh, Shlok, Joshi, Soham, Dutta, Rohan, Flint, Jay Shim George, Zhu, Kevin
Symbolic Regression remains an NP-Hard problem, with extensive research focusing on AI models for this task. Transformer models have shown promise in Symbolic Regression, but performance suffers with smaller datasets. We propose applying k-fold cross-validation to a transformer-based symbolic regression model trained on a significantly reduced dataset (15,000 data points, down from 500,000). This technique partitions the training data into multiple subsets (folds), iteratively training on some while validating on others. Our aim is to provide an estimate of model generalization and mitigate overfitting issues associated with smaller datasets. Results show that this process improves the model's output consistency and generalization by a relative improvement in validation loss of 53.31%. Potentially enabling more efficient and accessible symbolic regression in resource-constrained environments.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.64)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.62)